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The problem to be considered here is that of 
identifying, or of classifying, an observed in¬ 
dividual as being a member of one of two 
"populations/’ This problem arises in some form 
in most sciences. A recent example is the prob¬ 
lem, associated with certain international ten¬ 
sions, of classifying salmon caught in the North 
Pacific fishery as having arisen from the Asiatic 
or American salmon populations. 

The populations are to be considered as giving 
rise to observable individuals each of which 
may be (partially) characterized by a set of 
k measurements. The measurements of individ¬ 
uals from either population are distributed as 
if they were independent observations on a 
multivariate distribution of probability. These 
distributions are assumed to be multivariate 
normal, with known parameters, for each pop¬ 
ulation. 

1. Statement of the Problem 

When an individual is misclassified, there may 
or may not be loss functions associated with the 
misclassification. For the problems of this paper 
explicit results are not obtainable for general 
loss functions; we shall assume loss functions to 
be constants. Let us designate as a the loss as¬ 
sociated with misclassification of an individual 
from population I and as /3 the loss associated 
with misclassification of an individual from 
population II; a, ft > 0. Also, there is the ques¬ 
tion of whether or not anything is known about 
the mixed population from which the individual 
to be classified is drawn; in particular, whether 
or not there are known a priori probabilities, 
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under a random drawing, that an individual be¬ 
longs to either of the parent populations. Let us 
designate the prior probabilities as p for popula¬ 
tion I and q = 1 — p for population II. 

It follows that there are four levels of the 
classificatory problem to be considered: 

(1.1) (a) with loss functions and prior 

probabilities 

(1.2) (b) with prior probabilities only 

(1.3) (c) with loss functions only 

(1.4) (d) with neither 

Misclassifications are undesirable; however, 

there are no adequate common units in which 
the "undesirability” can be measured for all of 
the above levels. At each level there are two 
quantities for which some form of joint mini¬ 
mization is desired, viz.: 

(1.5) (a) apPi, /JqPn 

(1.6) (b) pPi, qPn 

(1.7) (C) aPi, 0Pn 

(1.8) (d) Pi, Pii 

where Pj is the probability that a random in¬ 
dividual of population I is classified as having 
arisen from II, and Pn is the probability that 
a random individual of II is classified as having 
arisen from I. 

These four pairs of quantities will be referred 
to indiscriminately as "error quantities.” 

Now either error quantity of a pair may be 
reduced to zero, but not both jointly. Thus, joint 
minimization of the error quantities is, to a 
certain extent, arbitrary. While various specifi¬ 
cations of joint minimization can be formulated, 
the more reasonable are those which have al¬ 
ready been proposed elsewhere in the literature, 
viz.: 

(i) joint minimization may be specified as 
that which minimizes the sum of error 
quantities; let us denote this criterion as 
"minisum”; 

(ii) joint minimization may be specified as 
that which minimizes the larger of the 
error quantities; let us denote this cri¬ 
terion as "minimax.” 
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The first of these was introduced on level (a) by 
Brown (1950) and the second introduced on 
level (b) by Welch (1939). There has been 
more recent work on discriminant analysis, some 
of which is at levels similar to this treatment, 
but little seems applicable as the risk functions 
are not well defined. 

Each of these specifications leads to the choice 
of one out of a family of quadratic discrimina¬ 
tors. However, there are two related major dif¬ 
ficulties: one is the determination of which 
member of the family is appropriate (for the 
minimax solution), and the other is that the 
integrals giving Pj and Pn cannot be evaluated 
explicitly (for either solution), and no tables 
are available for the resulting Pj and Pn. 

If the variance-covariance matrices of the two 
populations are equal, the quadratic discrimi¬ 
nator reduces to a linear discriminator; the in¬ 
tegrals for Pj and Pn may then be reduced to 
the incomplete integral of the standard normal 
density. This is always true for any linear dis¬ 
criminator. 

If we let A be a row vector of direction num¬ 
bers, X be a row vector of variables (represent¬ 
ing the possible measurements on the indi¬ 
vidual ), c be a constant, and let primes denote 
transposition, then a linear discriminator may 
be written: 

(1.9) AX' "c. 

We lose no generality if we number the popu¬ 
lations such that the individual represented by X 
is classified into population I if AX' < c and 
into population II if AX' ^ c. 

Let (mi, o-i 2 ), (m 2 , o- 2 2 ) be the mean and 
variance of AX' when X is distributed as in 
populations I, II, respectively. Then it follows, 
using an obvious notation, that: 


(1.10) P_ 


a.ii) p T7 


N(0, 1) dx: 


N(0. 1) dx 


2. The Appropriate Linear Vunction 

For the case when the distributions have 
identical variance-covariance matrices, the vector 


A is well known (see, for example, Fisher, 
1936), being the inverse of this common matrix 
multiplied by the vector of difference means. 
When the variance-covariance matrices are not 
equal but are proportionate, then the correspond¬ 
ing A (using either of the matrices) is still op¬ 
timum under both the minisum and minimax 
criteria. 

In many fields the assumption of propor¬ 
tionate but not necessarily equal variance- 
covariance matrices is not unreasonable. This 
situation occurs, for example, in marine biology. 
The Hawaiian tunas ahi (Neothunnus macrop- 
terus ) and ahipahala (Thunnus alalunga) are 
similar in most respects, but the ahi is a larger 
and more complex fish. If weight, fork length, 
lengths of second dorsal and anal fins, and the 
ratio of the length of the pectoral fin to the 
fork length (which varies inversely as the first 
four variables) are taken to be the variables, the 
population variance-covariance matrices for the 
ahi and ahipahala are (expected to be) propor¬ 
tional but unequal. Another example is cited in 
the literature, although only two variables were 
used. Mottley (1941) found that the variances 
and covariance for head and body measurements 
of trout ( Salmo gairdnerii kamloops) stocked 
in two Canadian lakes were proportional. 

The optimum A for general dispersion ma¬ 
trices is not easy to derive. This problem is con¬ 
sidered in another paper by the authors (I960). 
The current paper considers optimum c for 
given A and thus in what follows it is only 
necessary to consider that A has been deter¬ 
mined either by the methods mentioned above 
or arbitrarily. 

3. The Constant c for Minimized Error 
Quantities 

We lose no generality if we let m 2 > mi 
and 0-2 ^ (ti. The designation of the population 
having the larger standard deviation as popula¬ 
tion II is arbitrary. We may then make a scale 
transformation of dz 1, whichever is necessary 
to obtain m 2 > mi. 

We now wish to obtain expressions for the 
constant c which will minimize the error quan¬ 
tities under the minisum and minimax criteria, 
respectively. 
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Consider apPj + /3qPn. 


(3.1) 


3(<YpP I + /5 qP n ) 
JTc 



Equating the derivative to zero and rearrang¬ 
ing, we obtain 


(3.2) 





0 


which is a quadratic in c with minisum c as 
roots: 


(3.3) c(ms) 




Equation (3.3) has three possibilities: 

(1) when there are no real roots, 

(2) when no roots fall in (mi,m 2 ), 
and 

(3) when one and only one root falls 
in (mi,mo). 

If a root should fall at one of mi,m 2 , this may 
be considered as a limiting case of situation 
(2). Situation (1) is trivial; all individuals are 
classified into one population. In situation (2), 
linear discrimination is not very helpful; quad¬ 
ratic discrimination is indicated. In these situa¬ 
tions, possibly (depending on parameters) there 
is no discrimination which will be much of an 
improvement over the classification of all in¬ 
dividuals into one population or a purely ran¬ 
dom classification. Thus, situation (3) will be 
considered in this paper. 

When a root falls in (mi,m 2 ), this is the 
root which minimizes apPj + /3qPji, and is 
therefore the root desired. The other root max¬ 
imizes apPj + /3qPll and therefore will not be 
used. Since c t 2 has been arranged to be greater 
than o-i, and the smaller root is less than mi, 
the positive square root is required. When o-i 
= (7o,c(ms) is the root in (mi,mo); the other 
root is infinite. 

Consider now the minimizing max (apPp 
0qPlI). apPj and p qPn are monotonic, decreas¬ 
ing and increasing respectively, in c; and, there¬ 
fore, the desired c is located such that apPj = 


P qPlI. An explicit result will not be found in 
general, since the integrals have not been eval¬ 
uated explicitly. If ap = /3q, we have the in¬ 
tegrals identical except for upper limits of in¬ 
tegration, and apPj = /3qPn reduces to 



Solving, we obtain a minimax c: 

m it + m cr 

(3.5) c(mx) —Li_Li 

V'l 

It should be noted that if o-i = or 2 and ap 
= Pq, both c (ms) and c(mx) reduce to a c 
dependent upon only the centroids, 



This c(m) is the population analogue of the 
c introduced for samples by Barnard (1935) 
and Fisher (1936) and currently used in linear 
discriminant analysis. 

4. A Discussion of Levels and c’s 

The results (3.3) and (3.5) apply for the 
case in which loss functions and prior prob¬ 
abilities are known, i.e., (1.1). When either or 
both of these quantities are unknown, cor¬ 
responding to (1.2), (1.3), or (1.4), the cor¬ 
responding error quantities considered are given 
by (1.6), (1.7), or (1.8) respectively. The re¬ 
sults corresponding to (3.3) and (3.5) are ob¬ 
tained readily by the following substitutions in 

(3.3) and (3.5): 

(1.2) 'prior probabilities only”: a =r p =. 1 

(1.3) "loss functions only”: p = q — 1 

(1.4) "neither”: a = /3 = p — q=l. 

For level (a), where both prior probabilities 
and loss functions are known, the risk may be 
measured and specified. If the total risk is to 
be minimized, then c(ms) is the appropriate 
constant. If the risk is to be minimized, subject 
to the restriction that risks from each source 
are to be equal, then c(mx) is the appropriate 
constant. 

For level (b), where prior probabilities only 
are known, then c(ms) minimizes the condi¬ 
tional probability of misclassification. However, 
if classification is only part of the problem at 
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hand, then it may be desirable, in order to avoid 
bias in later stages, say, to minimize, subject 
to equalizing the probabilities of the two types 
of misclassification; here c(mx) is the appro¬ 
priate constant. 

For example, consider a merchandizing situa¬ 
tion. If the problem is to allocate a limited ship¬ 
ment of goods to two branches of the same 
store, the same management suffers the loss 
from understocking either branch, and c(ms) 
is the appropriate constant to use in specifying 
the quantities of goods to go to each branch. 
On the other hand, if the problem is to equalize 
buyer-seller risk, as in the case of an inde¬ 
pendent mediator handling quality control, then 
c(mx) is the appropriate constant to use in 
specifying the acceptable level of quality. 

For levels (c) and (d), the error quantities 
are in no sense absolute quantities. Here c(mx) 
will be the most reasonable constant to use, 
since under the minimax solution the expected 
numbers of misclassifications are equal for the 
two populations. 

In practice, a, /3 , p, and q may or may not be 
well defined conceptually, but either way will 
often, perhaps usually, be unknown. Thus a com¬ 
parison between discriminators using c(ms), 
c(mx), and c(m) at level (d) is appropriate. 

X Comparison of Discriminators 

Introduction. The discriminators may be 
compared on the basis of our minisum and 
minimax criteria. Let us designate these criteria 
respectively in terms of the error quantities as 

(i) P s = Pi + PlI 

(ii) P x = max (Pi, Pii). 

In comparing discriminators, it can happen 
that either one has both criteria less than or 
equal to those of the other or this does not occur. 
If the former holds, then the discriminator with 
the smaller criteria may be said to be better 
than the other. This is true whether the dis¬ 
crimination is linear or not. 

For the purposes of this paper, A has been 
taken to be a vector of constants. Thus, while 
linear discriminators are functions of both A 
and c, our comparison need be concerned only 
with varying c’s. The restriction to level (d) 
together with the vector of constants, A, enables 
us to keep the number of parameters down to 


two for comparisons of the discriminators AX' 
= c(ms), AX' — c(mx), and AX' = c(m). 
c(ms) and c (mx) are the c’s derived for the 
two criteria; both reduce toc(m) in the special 
case of equal dispersion matrices. c(m) is the 
population analogue of the c used in practice 
and is easier to compute than are c(ms) and 
c (mx). Since c (mx) andc(ms) each minimize 
one criterion, the comparisons will be to find 
the conditions under which c(m) leads to both 
smaller P x than does c (ms) and smaller P s than 
does c (mx). When these conditions are satisfied 
then c(m) may be regarded as a compromise 
between c (ms) and c (mx). 

The two essential parameters will be defined 
as 

(5.1) B = "2/^ 

am - m 

(5.2) C = -i 

"2 + v l 

It can be seen that B ^ 1 and C > 0. If results 
in B and C should be tabulated, the tables would 
be symmetric in log B, — log B, and in C, — C. 

Condition for reasonable linear discrimination. 
Under certain conditions, linear discrimination 
does not yield good results; an example of this 
is the situation in which the centroids of the 
two populations are the same. Any description 
of the conditions necessary for linear discrimina¬ 
tion to be able to lead to reasonable results must 
be, to some extent, arbitrary. Generally, the sit¬ 
uations in which linear discrimination may be 
rejected are typified by no root of c(ms) being 
contained in (mi, m 2 ). 

At level (d) there are always two real solu¬ 
tions of (3.3). By restricting our interest to the 
range (m 1} m 2 ) it follows from considerations 
of monotonicity, continuity, and limiting be¬ 
havior that a necessary and sufficient condition 
for the existence of a root of (3-3) in this range 
is 


since the left and right sides of the inequality 
are the densities of populations I and II at m 2 . 
(5.3) may be rewritten in terms of B and C 
as follows: 

(5.4) C 2 > 2 (B + l)" 2 In B 
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Fig. L Four regions in (B, C) corresponding to the 
properties: (1) no linear discriminator reasonable; 

(2) c(m) is a compromise between c(ms), c(mx); 

(3) c(ms) is better than c(m ); (4) both c(mx ) , 
c(ms) are better than c{m). In general, the larger the 
C, the stronger will be the discriminator. 

The lower curve in Figure 1 separates the re¬ 
gions in (B, C) for which (5.4) is true, un¬ 
true. Thus in region 1 a quadratic discriminator 
is appropriate; elsewhere a linear discriminator 
is appropriate. 

Investigation of when c(ms) is better than 
c(m). Let us denote the larger conditional 
probability of misclassification, P x , using c(m), 
c(ms) by P x (m), P x (ms) respectively. 

Now c(mx) is the point on either side of 
which the probabilities of misclassification are 
equal, so that a c < c (mx) indicates Pj = P x 
and a c > c(mx) indicates Pjj zz P x . Further, 
m 2 ^ mi, o "2 ^ o"i imply that both c(m) and 
c(ms) are greater than c(mx) since: 

(m, ~ m.)(S - O 

(5.5. a) c(m) -c(mx) = - “ - \ - 

l 0 *! T °*2 

(5.5.b) c(ms)-c(mx) - -AyjnijO- 2 -m^o- 2 -^^ + l 

*1^1 ~ “ L “ °W 

' {<r 2 - ° r i )(rn l Cr 2 + ^Z^l) | 

= T7| m 2' mi+ h‘ nil)2 ' 2( V ff i )ln «?f] 2 f 

Therefore, P x (rn ) zz Pji(m) and P x (ms) zz 
Pn(ms). 


It follows immediately that a necessary and 
sufficient condition for P x (m) > P x (ms) is 
c(m) > c(ms), 


P j_ 

' 2r r2p- m i )2+2 (v^ ln ^j > 11 

i 

i.e., I» 2 -., 1 I(V.|| > 

which may be rewritten as: 


'5.6) C 2 (3 2 - 1) - > o 

(B + l) 2 

The center curve in Figure 1 separates the 
regions of (B, C) for which P x using AX'" zz 
c(rns) is greater, less than those using AX' zz 
c(m). Thus in regions 1 and 2, c(m) is better 
with respect to the minimax criterion; in re¬ 
gions 3 and 4, c(ms) is better with respect to 
the minimax criterion. 

Investigation of when c(mx) is better than 
c(m). Let us denote the sum of conditional 
probabilities of misclassification, P s , using c(m), 
c(mx) by P B (m), P 8 (mx). 

On expressing Pj, Pn in terms of c(m), 
c (mx) and hence in terms of B, C, it follows, 
after rearrangement, that 

r C(Bfl) C(Btl) 


V 

■> 2 n 

1 N(0, l)dx - 

N(0, l)dx - 

0 ^ 

0 *1( 


-- g(B.C) 

say. From differential-geometrical considerations 
and the fact that both c(m), c(ms) are greater 
than c(mx), it follows that c (rn) < c(ms) 
implies that P s (m) < P s (mx). The upper 
curve in Figure 1 is the curve g(B, C) zz 0, 
which separates the regions of (B, C) for which 
the sum of conditional probabilities of mis¬ 
classification using AX' zz c(mx) is greater, 
less than those using AX' zz c(m). Thus in 
region 4, c(mx) is better with respect to the 
minisum criterion; elsewhere c(m) is better 
with respect to the minisum criterion. The 
asymptote as B tends to infinity is, approxi¬ 
mately, C zz 1.029° 

Figure 2 shows g(B, C) plotted against C 
for several values of B, 
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Fig. 2. Difference between (a) the sum of condi¬ 
tional probabilities of misclassifkation using c(m ), 
and (b) the same using c{mx), expressed as a func¬ 
tion of C for several values of B . 
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